Suppose you conduct a study where you enroll 200 patients with mild dementia symptoms, then

randomize them so that 100 receive an experimental drug intended for mild dementia symptoms, and

100 receive a placebo. You have the participants take their assigned product for six weeks, then you

record whether each participant felt that the product helped their dementia symptoms. You tabulate the

results in a fourfold table, like Figure 13-5.

© John Wiley & Sons, Inc.

FIGURE 13-5: Comparing a treatment to a placebo.

According to the data in Figure 13-5, 70 percent of participants taking the new drug report that it

helped their dementia symptoms, which is quite impressive until you see that 50 percent of participants

who received the placebo also reported improvement. When patients report therapeutic effect from a

placebo, it’s called the placebo effect, and it may come from a lot of different sources, including the

patient’s expectation of efficacy of the product. Nevertheless, if you conduct a Yates chi-square or

Fisher Exact test on the data (as described in Chapter 12) at α = 0.05, the results show treatment

assignment was statistically significantly associated with whether or not the participant reported a

treatment effect (

by either test).

Looking at inter- and intra-rater reliability

Many measurements in epidemiologic research are obtained by the subjective judgment of humans.

Examples include the human interpretation of X-rays, CAT scans, ECG tracings, ultrasound images,

biopsy specimens, and audio and video recordings of the behavior of study participants in various

situations. Human researchers may generate quantitative measurements, such as determining the length

of a bone on an ultrasound image. Human researchers may also generate classifications, such as

determining the presence or absence of some atypical feature on an ECG tracing.

Humans who perform such determinations in studies are called raters because they are assigning

ratings, which are values or classifiers that will be used in the study. For the measurements in your

study, it is important to know how consistent such ratings are among different raters engaged in rating

the same item. This is called inter-rater reliability. You will also be concerned with how

reproducible the ratings are if one rater were to rate the same item multiple times. This is called intra-

rater reliability.

When considering the consistency of a binary rating (like yes or no) for the same item between two

raters, you can estimate inter-rater reliability by having each rater rate the same group of items.

Imagine we had two raters rate the same 50 scans as yes or no in terms of whether each scan showed a

tumor or not. We cross-tabbed the results and present them in Figure 13-6.